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© A variable accuracy indirect addressing scheme for SIMD multi-processors and apparatus 
implementing same. 



© Described is a parallel processing architecture 
following the Single Instruction stream Multiple Data 
stream execution paradigm where a controller ele- 
ment (18) is connected to at least one processing 
element (1 0) with a local memory (38) having a local 
memory address shift register (46) adapted to re- 
ceive and retain therein a globally broadcast mem- 
ory base register address value received from the 
controller element (18) for use by the processing 
element for access and transfer of data between the 
processing element (10) and its respective local 
memory (38). A computer architecture for imple- 
menting indirect addressing and look-up tables in- 
cludes a processing element shift register (44) asso- 
ciated with the at least one processing element (10) 
and adapted to receive and retain therein a local 
memory offset address value calculated or loaded 
by the associated processing element (1 0) in accord 
with a first predetermined set of instructions. The 
processing element shift register (44) transfers its 
contents bitwise to the local memory shift register 
(46) of the local memory (38) associated with the 
processing element (10), with the bit value of the 
most significant bit position being sequentially trans- 



ferred to the least significant bit position 
memory shift register (46) in accord wit 
predetermined set of instructions. 
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BACKGROUND OF THE INVENTION 

1 . Field of the Invention 

This invention relates in general to computer 
architectures, and, more particularly, to a method 
and an apparatus for enabling indirect addressing 
and lookup table implementation on Single Instruc- 
tion stream Multiple Data stream (SIMD) multi- 
processor architectures. 

2. Description of the Related Art 

In existing SIMD computer architectures, mem- 
ory is generally accessed by the processor array 
as a single plane of memory locations. 

In conventional SIMD architectures, the mem- 
ory address location is broadcast, along with the 
instruction word, to all processing elements by the 
controller. This configuration normally results in the 
processing elements accessing a single plane of 
data in memory. Offsets from this plane of data 
cannot be done using this architecture, as there is 
no provision for specifying, or modifying, the local 
memory address associated with each processor 
based on local data in each processor. 

As a consequence of this lock-step approach, it 
is especially difficult to implement efficiently in- 
direct addressing and look-up tables in a parallel 
processing architecture following the Single Instruc- 
tion stream Multiple Data stream execution para- 
digm. Indirect addressing requires serialization of 
operations and thus uses O(N) cycles to perform 
the memory access in an N processor system. 

SUMMARY OF THE INVENTION 

Generally, the present invention is embodied in 
a method and computer architecture for implement- 
ing indirect addressing and look-up tables in a 
parallel processing architecture following the Single 
Instruction stream Multiple Data stream execution 
paradigm. 

SIMD architecture utilizes a controller element 
connected to an array of processing elements 
where each processing element has associated 
with it a local memory that has a local memory 
address shift register. The local memory address 
shift register is adapted to receive and retain a 
globally broadcast memory base register address 
value received from the host, or from the controller 
element, for use by the processing element in 
accessing and transferring data between the pro- 
cessing element and its respective local memory. 

Each of the processing elements is further as- 
sociated with a processing element shift register 
that is adapted to receive and retain a local mem- 
ory offset address value calculated or loaded by 



the processing element in accord with a first pre- 
determined set of instructions. The processing ele- 
ment shift register is also further adapted to trans- 
fer its contents bitwise to the local memory shift 
5 register of its associated local memory, with the bit 
value of the most significant bit position being 
sequentially transferred to the least significant bit 
position of the local memory shift register in accord 
with a second predetermined set of instructions. 

10 As an alternative, the processing element shift 
register can also be adapted to transfer its contents 
to the local memory shift register of its associated 
local memory, in a parallel transfer as described 
more fully below. 

75 The description of the invention presented is 
intended as a general guideline for the design and 
implementation of the invention into a specific im- 
plementation. Therefore, specific details of the de- 
sign, such as clock rates, the number of bits in 

20 each register, etc., are left to be determined based 
on the implementation technology and the allotted 
cost of the final product. In the following, the spe- 
cial details of the present invention, which are 
unique to this invention, are elaborated. 

25 The novel features of construction and opera- 
tion of the invention will be more clearly apparent 
during the course of the following description, ref- 
erence being had to the accompanying drawings 
wherein has been illustrated a preferred form of the 

30 device of the invention and wherein like characters 
of reference designate like parts throughout the 
drawings. 

BRIEF DESCRIPTION OF THE FIGURES 

35 

FIG. 1 is an idealized block schematic diagram 
illustrating the top level design of a computer 
architecture embodying the present invention; 
FIG. 2 is an idealized block schematic diagram 

40 illustrating the top level design of the processing 
elements forming the processor array in a com- 
puter architecture similar to that of FIG. 1 em- 
bodying the present invention; 
FIG. 3 is an idealized block schematic diagram 

45 illustrating the processor and memory level de- 
sign in a computer architecture similar to that of 
FIG. 1 embodying the present invention. 

DESCRIPTION OF THE PREFERRED EMBODI- 
50 MENTS 

With reference being made to the Figures, a 
preferred embodiment of the present invention will 
now be described in a method and an apparatus 
55 for providing a platform for efficient implementation 
of the computation associated with processing a 
wide variety of neural networks. 
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The invention is embodied in a computer ar- 
chitecture that can roughly be classified as a Sin- 
gle Instruction stream Multiple Data streams 
(SIMD) medium or tine grain parallel computer. The 
top level architecture of such an embodiment is 
depicted in Figure 1 where each Processing Ele- 
ment 10 is arranged on a two dimensional proces- 
sor array lattice 12. 

This architecture is most easily discussed in 
three major groupings of functional units: the host 
computer 16, the controller 18, and the Processor 
Array 20. 

The controller unit 18 interfaces to both the 
host computer 16 and to the Processor Array 20. 
The controller 18 contains a microprogram memory 
area 32 that can be accessed by the host 1 6. High 
level programs can be written and compiled on the 
host 16 and the generated control information can 
be downloaded from the host 16 to the micropro- 
gram memory 32 of the controller 1 8. The control- 
ler 18 broadcasts an instruction and possibly a 
memory address to the Processor Array 20 during 
each processing cycle. The processors 10 in the 
Processor Array 20 perform operations received 
from the controller 18 based on a mask flag avail- 
able in each Processing Element 10. 

The Processor Array unit 20 contains all the 
processing elements 10 and the supporting inter- 
connection network 14. Each Processing Element 
10 in the Processor Array 20 has direct access to 
its local column of memory within the architecture's 
memory space 23. Due to this distributed memory 
organization, memory conflicts are eliminated 
which consequently simplifies both the hardware 
and the software designs. 

In the present architecture, the Processing Ele- 
ment 1 0 makes up the computational engine of the 
system. As mentioned above, the Processing Ele- 
ments 10 are part of the Processor Array 20 sub- 
system and all receive the same instruction stream, 
but perform the required operations on their own 
local data stream. Each Processing Element 10 is 
comprised of a number of Functional Units 24, a 
small register file 26, interprocessor communication 
ports 28, s shift register (S/R) 29, and a mask flag 
30 as illustrated in FIG. 2. 

In addition to supplying memory address and 
control instructions to the Processor Array 20, each 
instruction word contains a specific field to control 
the loading and shifting of data into the memory 
address modifying register 29. This field is used 
when the memory address supplied by the instruc- 
tion needs to be uniquely modified based on some 
local information in each processor 10, as in the 
case of a table lookup. 

A novel feature of a computer architecture em- 
bodying the present invention is its hardware sup- 
port mechanism for implementing indirect address- 



ing or a variable accuracy lookup table in the SIMD 
architecture. 

Neural network models are a practical example 
of the use of the present invention in a SIMD 

5 architecture. Such neural network models use a 
variety of non-linear transfer functions such as the 
sigmoid, the ramp, and the threshold functions. 
These functions can be efficiently implemented 
through the use of a lookup table. Implementation 

io of a table lookup mechanism on a SIMD architec- 
ture requires a method for generation/modification 
of the memory address supplied by the controller 
18, based on some local value in each Processing 
Element 10. 

75 A prior art architecture named BLITZEN devel- 
oped by D.W. Blevins, E.W. Davis, R.A. Heaton 
and J.H. Rief and described in their article titled, 
"BLITZEN: A Highly Integrated Massively Parallel 
Machine," in the Journal of Parallel and Distributed 

20 Computing (1990), Vol. 8, pp 150 - 160, performs 
this task by logically ORing the 10 most significant 
bits of the memory address supplied by the con- 
troller, with a local register value. Such a scheme 
does not offer sufficient flexibility as required for 

25 general-purpose neurocomputer design. The accu- 
racy, or level of quantization of the neuron output 
values tolerated by neural networks can vary sig- 
nificantly (from 2 to 16 bits) among different neural 
network models and different applications of each 

so model. 

In order to accommodate lookup tables of vary- 
ing sizes, an architecture embodying the present 
invention incorporates two shift registers 44, 46 in 
FIG. 3 (shift register 44 in FIG. 3 is the equivalent 

35 of shift register 29 in FIG. 2) that are used to 
modify the address supplied by the controller 18. 
One shift register 44 is associated with the Pro- 
cessing Element 10 and keeps the data value used 
for addressing the lookup table. The other shift 

40 register 46 is associated with the Processing Ele- 
ment's local memory 38 and is used to modify the 
address received from the controller 18. See FIG. 
3. The table lookup procedure for a table of size 2 k 
is initiated when the controller 18 loads the base 

45 address of the table to each of the shift registers 
46 associated with each Processing Element's local 
memory 38 using a broadcast instruction. The base 
address value is right shifted by k bits before being 
broadcast by the controller 18. This will insure that 

50 the proper value is being used after the augmenta- 
tion of the k bit offset value. The offset value is 
then shifted into this register 46 one bit at the time 
from the local register 44 in the Processing Ele- 
ment 10 starting from the most significant bit into 

55 the least significant bit of the memory address 
register 46. The control signals for this shifting 
operation are generated by the controller 18 and 
are broadcast to all Processing Elements 10 as 
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part of the microinstruction word. With this proce- 
dure, an address for a table of size 2 k can be 
generated in k time steps by each processor. By 
using a bitwise shifting operation, variable accuracy 
can be achieved in accessing data from memory 5 
array 22. 

If variable accuracy retrieval is not necessary, 
the present invention may be implemented by a 
parallel transfer into register 46 the contents of the 
local register 44 in the Processing Element 1 0. The w 
advantage of the parallel shifting of data between 
these two registers over that of the bitwise serial 
shifting scheme is that only a single cycle of the 
architecture is needed. However, it requires more 
physical pins and wires for interconnection of the 15 
physical chips comprising the various functional 
components of the architecture and provides only a 
fixed accuracy into the desired table held in mem- 
ory. 

Similarly, the bitwise shifting scheme described 20 
above has as advantages over the parallel transfer 
scheme that it requires only a single pinout and 
wire, and provides variable accuracy into the de- 
sired memory table. However, it requires more 
machine cycles to shift out the register contents in 25 
a bitwise fashion than the parallel transfer of the 
alternate scheme. 

The invention described above is, of course, 
susceptible to many variations, modifications and 
changes, all of which are within the skill of the art. 30 
It should be understood that all such variations, 
modifications and changes are within the spirit and 
scope of the invention and of the appended claims. 
Similarly, it will be understood that Applicant in- 
tends to cover and claim all changes, modifications 35 
and variations of the example of the preferred 
embodiment of the invention herein disclosed for 
the purpose of illustration which do not constitute 
departures from the spirit and scope of the present 
invention. 40 

Claims 

1. In a parallel processing architecture following 
the Single Instruction stream Multiple Data 45 
stream execution paradigm where a controller 
element (18) is connected to at least one pro- 
cessing element (10) with a local memory (38) 
having a local memory address shift register 
(46) adapted to receive and retain therein a so 
globally broadcast memory base address value 
received from the controller element (18) for 
use by the processing element (10) for access 
and transfer of data between the processing 
element (10) and its respective local memory ss 
(38), a computer architecture for implementing 
indirect addressing and look-up tables com- 
prising: 



a processing element shift register (44) 
associated with the at least one processing 
element (10) and adapted to receive and retain 
therein a local memory offset address value 
calculated or loaded by the associated pro- 
cessing element (10) in accord with a first 
predetermined set of instructions, said pro- 
cessing element shift register (44) further 
adapted to transfer its contents bitwise to the 
local memory shift register (46) of the local 
memory (38) associated with the processing 
element (10), with the bit value of the most 
significant bit position being sequentially trans- 
ferred to the least significant bit position of the 
local memory shift register (46) in accord with 
a second predetermined set of instructions. 

2. In a parallel processing architecture following 
the Single Instruction stream Multiple Data 
stream execution paradigm where a controller 
element (18) is connected to at least one pro- 
cessing element (10) with a local memory (38) 
having a local memory address shift register 
(46) adapted to receive and retain therein a 
globally broadcast memory base address value 
received from the controller element (18) for 
use by the processing element (10) for access 
and transfer of data between the processing 
element (10) and its respective local memory 
(38), a computer architecture for implementing 
indirect addressing and look-up tables com- 
prising: 

a processing element shift register (44) 
associated with at least one processing ele- 
ment (10) and adapted to receive and retain 
therein a local memory address value calcu- 
lated or loaded by the associated processing 
element (10) in accord with a first predeter- 
mined set of instructions, said processing ele- 
ment shift register (44) further adapted to 
transfer its contents to the local memory shift 
register (46) of the local memory (38) asso- 
ciated with the processing element (10), in a 
parallel transfer of bits between the two regis- 
ters (44, 46). 

3. A computer architecture for implementing in- 
direct addressing and look-up tables in a par- 
allel processing architecture following the Sin- 
gle Instruction stream Multiple Data stream ex- 
ecution paradigm, the architecture comprising: 

a controller element (18) connected to at 
least one processing element (10), said pro- 
cessing element (10) associated with a local 
memory (38) having a local memory address 
shift register (46) adapted to receive and retain 
therein a globally broadcast memory base reg- 
ister address value received from said control- 
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ler element (18) for use by said processing 
element (10) for access and transfer of data 
between the processing element (10) and its 
respective local memory (38), 

said at least one processing element (10) 5 
further associated with a processing element 
shift register (44) adapted to receive and retain 
therein a local memory offset address value 
calculated or loaded by said processing ele- 
ment (10) in accord with a first predetermined w 
set of instructions, said processing element 
shift register (44) further adapted to transfer its 
contents bitwise to said local memory shift 
register (46) of said local memory (38) asso- 
ciated with the processing element (10), with 15 
the bit value of the most significant bit position 
being sequentially transferred to the least sig- 
nificant bit position of said local memory shift 
register (46) in accord with a second predeter- 
mined set of instructions. 20 

4. In a computer system having a controller (18) 
connected to a plurality of processing ele- 
ments (10), and associated with each process- 
ing element a local memory (38) with a local 25 
memory shift register (46) for access and 
transfer of data between the associated pro- 
cessing element (10) and its associated local 
memory (38), a system for implementing in- 
direct addressing and look-up tables in a par- 30 
allel processing architecture following the Sin- 
gle Instruction stream Multiple Data stream ex- 
ecution paradigm, the architecture comprising: 
a plurality of processing element shift reg- 
isters (44), each associated with a respective 35 
one of the processing elements (1 0) and each 
adapted to receive and retain therein a local 
memory offset address value calculated by 
each associated processing element (10) in 
accord with a first predetermined set of 40 
instructions, each of said processing element 
shift registers (44) further adapted to transfer 
its contents bitwise to the local memory shift 
register (46) of the local memory (38) asso- 
ciated with the processing element (10) with 45 
the bit value of the most significant bit position 
being sequentially transferred to the least sig- 
nificant bit position of the local memory shift 
register (46) in accord with a second predeter- 
mined set of instructions. 50 
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